Web Page Quality Estimation Based on Linear Discriminant Function

نویسندگان

  • Rongwei Cen
  • Yiqun Liu
  • Min Zhang
  • Liyun Ru
  • Shaoping Ma
چکیده

With the growth of web data, how to estimate web page quality effectively and rapidly becomes more and more important for web information retrieval and knowledge discovery. This paper analyzes the differences between retrieval target pages and ordinary pages using query-independent features. Using these features, an algorithm called Linear Page Estimation (LPE) is proposed for web page quality estimation. Based on experiments on .GOV corpus and SOGOU corpus involving 26 million pages, about 95% pages can be reduced with more than 90% retrieval target pages retained using our algorithm. Experimental results based on TREC datasets also show that retrieval performance on collections selected by our algorithm can be close to or even better than that on the whole collection.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST

Abstract: MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and the R language. It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term. Also included are functions that ...

متن کامل

Information Quality of Commercial Web Site Home Pages: An Explorative Analysis

In the search for substantive relationships in the use of emerging technology, information quality is often difficult to assess. This research explores user perceptions of presentation, navigation, and quality of Web home pages for approximately 200 selected Fortune 500 companies across 10 industries. An instrument is developed to measure these constructs and is assessed for convergent and disc...

متن کامل

Scott Nicholson - Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Work

Nicholson, S. (2003). Bibliomining for automated collection development in a digital library setting: Using data mining to discover web-based scholarly research works. 0. ABSTRACT This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came fr...

متن کامل

Direct Multi-label Linear Discriminant Analysis

Multi-label problems arise in different domains such as digital media analysis and description, text categorization, multi-topic web page categorization, image and video annotation etc. Such a situation arises when the data are associated with multiple labels simultaneously. Similar to single label problems, multi label problems also suffer from high dimensionality as multi label data often hap...

متن کامل

Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation

Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation Introduction The advent of Web 2.0 enables the users to interact and prepare free unlimited real time data. This advantage leads us to exploit Volunteer Geographic Information (VGI) for real time crisis management. Traditional estimation methods for earthquake damages are expensive and tim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007